Transitional Type Creation for Multi-Stage Data Transformation in XChange
نویسندگان
چکیده
Many applications operate using a model of several sources of identically typed data all being sent to a collection of several sinks each using different aspects of the data. The data volumes generated for these applications may be too large to allow transfer of a full copy of the data to each sink. Computational limitations may prevent the server from transforming the data into a reduced form suitable for transmission appropriate for each sink. The sinks themselves may be computationally bound and unable to perform the level of data manipulation required for their processing. For all of these instances, moving the data transformation operations into an overlay network along the paths between the sources and the sinks can yield better quality of service guarantees. This work demonstrates how to generate the intermediate type and transformation information for each of the intermediate nodes as the data flows from the source and transforms into the form required for each participating sink. Introduction Many scientific and technical applications fit the model of a single or group of sources feeding several sinks simultaneously. For example, in molecular dynamics one view of the data could be a visualization of the molecular structure while a second view could be a histogram of molecular distances. For the first viewer, a complete data feed is necessary. For the second, a very limited summary form of the data is all that is required. The source may be at its computational or storage performance limits just providing a single copy of the data unable to devote any more processing or resources to generating additional streams. The network bandwidth may be limited to a degree that two complete streams may not be able to be sent with the quality of service guarantees demanded. The second display may be a device with very limited bandwidth and limited computational power, such as a handheld. These considerations demand a different approach for transmitting and transforming the data into a useable form. By transforming the data inflight from the source to the sink, underutilized intermediate nodes can offload the processing requirements of the sources, sinks, or both and reduce transmission loads by delaying splitting of data into multiple streams much later reducing the network bandwidth required for tighter quality of service guarantees. Another area where this technique can be employed is in file systems of high performance systems. The access speeds of secondary storage may not be fast enough to keep sink applications provisioned with sufficient data. By moving the requirement of generating multiple streams from the source system, better scaling can be achieved. It can also be used for modeling of materials, modeling stresses for airplane design, hydrology simulations and other applications. This paper will describe the general architecture of the system as it pertains to the generation of intermediate types. First will be a short architectural description of the types of system nodes. Second is a description of the process for generating the overlay network and distributing the processing. Issues related to propagating source elements is examined third followed by issues related to propagating generated sink elements fourth. The fifth section describes some issues with generating the copy transforms for propagating elements. Sixth is a short discussion of how the system performs, what work remains and what has already been built previous to this project. The last is a discussion of related work. Architecture The system consists of three types of nodes: 1. Sources – These machines produce the identical type of content for consumption. The initial work is assuming only one source, but most of the tools and techniques generalize easily for multi-source considerations. One of these sources will be considered the controller and manages the overlay. 2. Sinks – These are the consumer machines. Each of these nodes will request the source data, manipulate the data in some way and generate some form of output. 3. Intermediates – These are the extra nodes in the network along the paths from the sources to the sinks where computation can be off-loaded. Each of these nodes has different resource levels available. By taking advantage of knowledge of the multiple sinks’ needs for data transformation and the capabilities of these intermediate nodes, the intermediate nodes can be employed to offload both computational resources and network bandwidth load providing tighter quality of service guarantees. Generating the Overlay Network In order to generate the overlay network to combine transformation operations, several activities must take place on the sinks and the source. The process below describes the steps necessary to generate this overlay: 1. For each sink, decompose the application such that the transformation of the source data into the exact pieces of the source data required for the sink to operate are extracted. These extracted pieces of code will be termed a collection of transformation operations, as it is a description of how to transform the source data into what is required by this sink. 2. For each sink, create a transformation description. The currently supported description format uses XML with a general-purpose intermediate representation capable of handling the compiled output from the created XML format and potentially other languages. The description consists of an XML schema [3] description of both the source and sink types, an indicator of what the root types are for both the source and sink and a list of transformation operations. These four components are the only pieces used to keep the system portable. By avoiding any encoding of platform or language specific items, this format can be used more generally. 3. Each sink communicates with the controller source sending the transformation description. The controller decomposes each set of transformation operations generating a graph with merged nodes for identical operations and splits where computation is no longer shared. The operations are ordered according to an optimization procedure to emphasize the characteristics most important to the system designer. This graph is mapped onto the available intermediate nodes to form the overlay network. The path from the source to each sink is capable of generating the proper data type and format required by that sink. The nodes placed earlier along the path generally will be the ones that are most common across sinks. The branching will occur where the evolving types diverge. 4. The controller uses the assignment of transform operations to intermediate nodes to generate the appropriate intermediate types with appropriate transformation operations to allow the data to flow and transform from the source to the sinks arriving in the various proper formats. 5. The intermediate forms are distributed to all of the participating nodes forming communication channels. The sinks are connected into this network and the data flow begins. This project focuses on step four above. There are three challenges that must be addressed in generating the intermediate types and transforms: 1. If a later, but not the next intermediate node needs a source element, it must be known so that it can be propagated allowing the later transform to operate. 2. Once a sink element has been generated, it must be propagated through each intermediate node and ultimately to the sink. 3. Since source and sink types may have identical type names, care must be taken to ensure no name collisions occur when propagating elements. Each of these issues will be addressed below. In order to evaluate the impact of these challenges, four types of node-to-node connections must be considered: 1. Source-to-Sink – This is the simplest case with no intermediate nodes. 2. Source-to-Intermediate – In this case, we need to potentially propagate some source elements, but do not have to worry about propagating any generated sink elements. 3. Intermediate-to-Intermediate – This case requires both the potential propagation of source elements and sink elements. 4. Intermediate-to-Sink – This case only requires propagation of generated sink elements. The first type of connection is the base, simple case. The default behavior of the underlying infrastructure used, PBIO/ECho [2], handles this for free. The rest of this paper will focus on the other three more interesting types of connections. There are several interesting issues to be dealt with: 1. Identifying source elements needing to be propagated 2. Identifying generated sink elements needing to be propagated 3. Generating the proper intermediate data type description considering the assigned transforms, propagated source elements and propagated sink elements 4. Generating the transformation operations to propagate the source and sink elements. Propagating Source Elements Identifying source elements for propagation is relatively straightforward. Each transformation identifies what source elements it needs in order to compute. Then given the ordering of transformations on the path from the source to the sink, the union of the required source elements from all nodes downstream from the current node is computed. When propagating source elements, it is important to realize they are propagated as part of the generated sink data type. To ensure safety, propagated type names must be adjusted to avoid conflicts with the sink data types. The approach used to accomplish this is twofold. First, a new node was introduced into the intermediate node destination type as a root for the propagated source elements. And second, a unique string is prepended on the type names of types referenced by the source elements. This means that each required source type is recreated in the sink type with the type name altered and populated with only the required elements. Access to source elements must be adjusted to include the proper prefix of this newly generated root element. In the unlikely event generated type names or the root source propagation element name conflicts with the sink type, the type names can all be recalculated without impacting the operation of the transforms. For the source element name, a new entry in the XML can be introduced to list what the name is. This approach was avoided at this time for simplicity and expediency. Propagating Generated Sink Elements One of two approaches could be taken. Either use a similar approach as was employed for the source elements but looking at all of the upstream transforms or looking at the incoming type for the node and copying all of the elements that are not the propagated source elements. The latter approach was selected. Unlike the source element case where an element may or may not need to be propagated to the next node, once an element is generated, it must always be propagated to every subsequent node including the final sink. Rather than examining all of the upstream nodes’ transforms, it is easier to look at the incoming type this node and copy all of the existing sink elements to the outgoing type. Any new elements generated by the assigned transforms can be added as they are for any node in the graph. This ensures the destination elements will be propagated properly. Generating Copy Transforms In both propagation situations described above, knowing which elements to propagate is not enough. The transforms to actually copy the elements from the incoming type to the outgoing type must be generated. The best approach would use a block memory transfer and copy all of the elements in a single operation. In this first pass, an assignment transform is generated for each propagated element. The transform generation code is sufficiently isolated to allow easy replacement with a more efficient approach later. This approach also does not preclude the procedure generating the executable transformation code from being more intelligent about the data copy operations by evaluating all of the transforms assigned to the node and performing block operations where appropriate. Much like the source propagation, there are issues involving the hiding of the propagated source elements inside the sink type. If the generated transform is for a propagated source element, it has to be smart enough to know where the source element is currently stored and where to place it. It is important to determine when to add the proper prefix based on if the element is stored relative to the native root, for the source-to-intermediate case, or in the propagated source element root for the intermediate-to-intermediate case. The sink elements could have had similar complications if they had been similarly packaged in a subelement. Eliminating this additional layer of complexity was chosen over the inherent consistency of all intermediate forms consisting of a separate source and sink tree. Discussion In the current demonstration state of this system, it is capable of taking an XML description of a source data type, a destination data type, the root types for each and a list of transforms. From that description, it decomposes the transforms into individual lists for each node and generates the appropriate input and output types and the transforms needed to perform the calculation and do any data propagation. For each node, the output from the previous node is used as the input type for the next node. To ensure the data is in the proper memory layout at both the source and the sink, the original source description is used when interacting with the source and similarly with the sink nodes. The next steps for this project are many. First, it is important develop a good set of examples from different domains that demonstrate the potential for this work. Second, the demons that handle the management of the creation of the overlay network on each node need to be completed. Last, it is important to perform experiments to measure the advantages this approach can deliver. Far future work includes considering multiple sources and even multiple sources with similar, but not identical data types sharing the overlay network. Previous work on this project generated the XML syntax, a compiler for the transforms generating the E-Code language [4] used by PBIO/ECho for filters and a prototype system and protocol for a client to send the XML description file to a server to setup a communication channel and begin data transmission. This prototype can be reworked slightly to generate the demon. Related Work The Armada project [1] does similar operations, but the focus is on taking existing decomposed computation and reordering it where possible to maximize the data reduction close to the source and delay data growth until as close to the sinks as possible. This project is similar to dQUOB [2], but seeks to have a broader scope. dQUOB provides a way to break apart SQL operations and distribute the work along the network path from the source to the sink. It does not take into account multiple queries or multiple sinks in performing the optimization. References[1] R. Oldfield and D. Kotz. Armada: a parallel I/O framework for computationalgrids. Future Generation Computer Systems, 18(4):501–523, 2002.[2] B. Plale and K. Schwan. dQUOB: Managing large data flows using dynamicembedded queries. In HPDC, pages 263–270, 2000.[3] XML Schema. W3C. http://www.w3.org/XML/Schema.[4] G. Eisenhauer. Dynamic Code Generation with the E-Code Language. TechnicalReport GIT-CC-02-42, Georgia Institute of Technology, College of Computing,July 2002.
منابع مشابه
Provide a model for the establishment of the school in accordance with the indicators and requirements of the Education Transformation Document
Purpose: The aim of this study was to provide a model for school establishment in accordance with the indicators and requirements of the Education Transformation Document. Methodology: The research method was basic-applied in terms of purpose, descriptive-survey in terms of data collection method and combined in terms of data type. The statistical population of the study in the qualitative sect...
متن کاملThe language Xchange: a declarative approach to reactivity on the web
The research topic investigated by this thesis is reactivity on the Web. Reactivity on the Web is an emerging research issue covering: updating data on the Web, exchanging information about events (such as executed updates) between Web sites, and reacting to combinations of such events. Following a declarative approach to reactivity on the Web, a novel reactive language called XChange is propos...
متن کاملThe Effectiveness of Transformation Model-Relationship-Based Differences (DIR) on Internet addiction in female students (aged 12-9) in follow-up to six months
The purpose of this study was to investigate the effectiveness of DIR-based transformation-change model on reducing internet addiction among female students of Mashhad city. In a semi-experimental design, 30 female students (9-12 years old) who were selected by multi-stage cluster sampling were randomly assigned into two experimental and control groups. For the experimental group, eight session...
متن کاملAccuracy improvement of Best Scanline Search Algorithms for Object to Image Transformation of Linear Pushbroom Imagery
Unlike the frame type images, back-projection of ground points onto the 2D image space is not a straightforward process for the linear pushbroom imagery. In this type of images, best scanline search problem complicates image processing using Collinearity equation from computational point of view in order to achieve reliable exterior orientation parameters. In recent years, new best scanline sea...
متن کاملPerformance Data Morphing in Distributed Applications
In both high performance and enterprise applications, it is common for components to generate, exchange, process, and store or display large volumes of data. A key problem for such exchanges is the mismatches of data being generated with the data required by communicating software components. Such mismatches arise from natural differences in the data representations used by different components...
متن کاملIdentifying the change time of multivariate binomial processes for step changes and drifts
In this paper, a new control chart to monitor multi-binomial processes is first proposed based on a transformation method. Then, the maximum likelihood estimators of change points designed for both step changes and linear-trend disturbances are derived. At the end, the performances of the proposed change-point estimators are evaluated and are compared using some Monte Carlo simulation experimen...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004